AITopics | n-gram statistics

Collaborating Authors

n-gram statistics

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Understanding Transformers via N-Gram Statistics

Neural Information Processing SystemsMay-27-2025, 13:09:02 GMT

Transformer based large-language models (LLMs) display extreme proficiency with language yet a precise understanding of how they work remains elusive. One way of demystifying transformer predictions would be to describe how they depend on their context in terms of simple template functions. This paper takes a first step in this direction by considering families of functions (i.e. By studying how well these rulesets approximate transformer predictions, we obtain a variety of novel discoveries: a simple method to detect overfitting during training without using a holdout set, a quantitative measure of how transformers progress from learning simple to more complex statistical rules over the course of training, a model-variance criterion governing when transformer predictions tend to be described by N-gram rules, and insights into how well transformers can be approximated by N-gram rulesets in the limit where these rulesets become increasingly complex. In this latter direction, we find that for 79% and 68% of LLM next-token distributions on TinyStories and Wikipedia, respectively, their top-1 predictions agree with those provided by our N-gram rulesets.

n-gram ruleset, n-gram statistics, transformer prediction

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Efficient Training of Language Models with Compact and Consistent Next Token Distributions

Sathe, Ashutosh, Sarawagi, Sunita

arXiv.org Artificial IntelligenceJul-3-2024

Maximizing the likelihood of the next token is an established, statistically sound objective for pre-training language models. In this paper we show that we can train better models faster by pre-aggregating the corpus with a collapsed $n$-gram distribution. Previous studies have proposed corpus-level $n$-gram statistics as a regularizer; however, the construction and querying of such $n$-grams, if done naively, prove to be costly and significantly impede training speed, thereby limiting their application in modern large language model pre-training. We introduce an alternative compact representation of the next token distribution that, in expectation, aligns with the complete $n$-gram distribution while markedly reducing variance across mini-batches compared to the standard next-token loss. Empirically, we demonstrate that both the $n$-gram regularized model and our approximation yield substantial improvements in model quality and convergence rate compared to existing methods. Furthermore, our approximation facilitates scalability of gains to larger datasets and models compared to the straightforward $n$-gram regularization method.

cocont, dataset, language model, (16 more...)

arXiv.org Artificial Intelligence

2407.02819

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(9 more...)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)

Add feedback

Understanding Transformers via N-gram Statistics

Nguyen, Timothy

arXiv.org Artificial IntelligenceJun-30-2024

Transformer based large-language models (LLMs) display extreme proficiency with language yet a precise understanding of how they work remains elusive. One way of demystifying transformer predictions would be to describe how they depend on their context in terms of simple template functions. This paper takes a first step in this direction by considering families of functions (i.e. rules) formed out of simple N-gram based statistics of the training data. By studying how well these rulesets approximate transformer predictions, we obtain a variety of novel discoveries: a simple method to detect overfitting during training without using a holdout set, a quantitative measure of how transformers progress from learning simple to more complex statistical rules over the course of training, a model-variance criterion governing when transformer predictions tend to be described by N-gram rules, and insights into how well transformers can be approximated by N-gram rulesets in the limit where these rulesets become increasingly complex. In this latter direction, we find that for 78% of LLM next-token distributions on TinyStories, their top-1 predictions agree with those provided by our N-gram rulesets.

optimal rule, prediction, statistics, (14 more...)

arXiv.org Artificial Intelligence

2407.12034

Country:

South America > Guyana (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
North America > United States > Ohio > Franklin County > Columbus (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CMU's ASR2K Pipeline Recognizes Speech in 1909 Languages Without Audio

#artificialintelligenceSep-15-2022, 05:03:58 GMT

AI-powered speech recognition systems have made great progress in recent years, with speech-to-text processing now so powerful that the occasional errors are little more than curious exceptions. Most contemporary models addressing this task however require massive labelled training data -- which is simple enough to source for English, Chinese, and other popular languages but challenging to obtain in the case of the low-resource tongues that make up the majority of the world's 8,000 languages. To address this issue, a Carnegie Mellon University research team has developed a speech recognition pipeline that can recognize 1909 languages without any audio for the target language. Their ASR2K pipeline achieves impressive 45 percent CER and 69 percent WER scores when using 10,000 raw text utterances on the CMU Wilderness dataset, and is introduced in the paper ASR2K: Speech Recognition for Around 2000 Languages Without Audio. The proposed pipeline comprises separate acoustic, pronunciation, and language models.

asr2k pipeline recognize speech, language model, pronunciation model, (9 more...)

#artificialintelligence

Country: Asia > South Korea > Incheon > Incheon (0.06)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

ASR2K: Speech Recognition for Around 2000 Languages without Audio

Li, Xinjian, Metze, Florian, Mortensen, David R, Black, Alan W, Watanabe, Shinji

arXiv.org Artificial IntelligenceSep-6-2022

Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.

dataset, pronunciation model, recognition, (15 more...)

arXiv.org Artificial Intelligence

2209.02842

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Africa > Sub-Saharan Africa (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

HyperSeed: Unsupervised Learning with Vector Symbolic Architectures

Osipov, Evgeny, Kahawala, Sachin, Haputhanthri, Dilantha, Kempitiya, Thimal, De Silva, Daswin, Alahakoon, Damminda, Kleyko, Denis

arXiv.org Artificial IntelligenceOct-15-2021

Across all experiments, Hyperseed convincingly machine learning and robotics context is currently gaining a demonstrates its key novelties of learning from a few input great momentum [1]-[6]. In classification tasks, the use of vectors and single vector operation learning rule, both of which VSA leads to order of magnitude increase in energy efficiency contribute towards reduced time and computation complexity. of computations on the one hand and natively enables oneshot The paper is structured as follows. Section II describes and multitask learning on the other [7]. It is prospected the related work relevant to Hyperseed operations. The used that VSA will play a key role in the development of novel methods including the fundamentals of VSA are presented neuromorphic computer architectures [8] as an algorithmic in Section III. Section IV presents the main contribution - abstraction [9], [10]. The main contribution of this paper is the method for unsupervised learning Hyperseed. Section V a novel algorithm for unsupervised learning called Hyperseed, reports the results of the performance evaluation the experiments.

accuracy, hyperseed, hypervector, (15 more...)

arXiv.org Artificial Intelligence

2110.08343

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
North America > Canada > Ontario > Toronto (0.14)
Europe > Sweden > Norrbotten County > Luleå (0.04)
(3 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Add feedback

Detecting and Exorcising Statistical Demons from Language Models with Anti-Models of Negative Data

Wick, Michael L., Silverstein, Kate, Tristan, Jean-Baptiste, Pocock, Adam, Johnson, Mark

arXiv.org Artificial IntelligenceOct-22-2020

It's been said that "Language Models are Unsupervised Multitask Learners." Indeed, self-supervised language models trained on "positive" examples of English text generalize in desirable ways to many natural language tasks. But if such models can stray so far from an initial self-supervision objective, a wayward model might generalize in undesirable ways too, say to nonsensical "negative" examples of unnatural language. A key question in this work is: do language models trained on (positive) training data also generalize to (negative) test data? We use this question as a contrivance to assess the extent to which language models learn undesirable properties of text, such as n-grams, that might interfere with the learning of more desirable properties of text, such as syntax. We find that within a model family, as the number of parameters, training epochs, and data set size increase, so does a model's ability to generalize to negative n-gram data, indicating standard self-supervision generalizes too far. We propose a form of inductive bias that attenuates such undesirable signals with negative data distributions automatically learned from positive data. We apply the method to remove n-gram signals from LSTMs and find that doing so causes them to favor syntactic signals, as demonstrated by large error reductions (up to 46% on the hardest cases) on a syntactic subject-verb agreement task.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2010.11855

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(4 more...)

Genre: Research Report (0.82)

Industry: Banking & Finance > Trading (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback